Setup

library(ngsReports)
library(magrittr)
library(scales)
library(pander)
library(tidyverse)
rawFqc <- list.files("../0_rawData/FastQC/", pattern = "zip", full.names = TRUE) %>%
    getFastqcData()
deMuxFqc <- list.files("../1_demux/FastQC/", pattern = "zip", full.names = TRUE) %>%
    getFastqcData()
samples <- read_tsv("../0_rawData/samples.tsv")

Base Qualities

Each of the libraries was inspected for overall quality. Positions 3 & 4 from R2 libraries FCC21WPACXX-CHKPEI13070002_L6 and FCC21WPACXX-CHKPEI13070003_L7 showed clear problems with read qualities. This was likely due to an unsatisfactory nucleotide diversity in the original sequencing run and not enough phiX to overcome this, particularly as all R2 reads will terminate with the restriction site.

Base qualities before demultiplexing.

Base qualities before demultiplexing.

Read Totals

The first check after demultiplexing is to ensure that read were not assigned to multiple individuals by sabre pe. Read Totals before and after demultiplexing were then checked and the recovery rate was >90% for each library. This process was only applied to the data from the 1996 and 2012 populations as the Turretfield samples were provided after demultiplexing by the collaborators from this dataset. Results indicated that demultiplexing was performed with a high degree of success.

rawReadTotals <- readTotals(rawFqc) %>%
    mutate(Library = str_remove_all(Filename, "_[12].fq.gz")) %>%
    distinct(Library, Total_Sequences)
deMuxReadTotals <- readTotals(deMuxFqc) %>%
    mutate(ID = str_remove_all(Filename, ".[12].fq.gz")) %>%
    distinct(ID, Total_Sequences) %>%
    left_join(samples, by = "ID") %>%
    group_by(Library) %>%
    summarise(DeMultiplexed = sum(Total_Sequences)) %>%
    filter(!is.na(Library))
Recovery rate from demultiplexing the 1996 and 2012 samples
Library Total Sequences DeMultiplexed Recovery Rate
FCC21WPACXX-CHKPEI13070002_L6 69,729,067 63,767,250 91.45%
FCC21WPACXX-CHKPEI13070003_L7 65,481,942 65,031,823 99.31%
FCC229TACXX-CHKPEI13070001_L3 56,981,567 56,588,142 99.31%
FCC2GPDACXX-CHKPEI13070004_L2 78,411,565 78,043,413 99.53%
FCC2GPDACXX-CHKPEI13070005_L3 78,445,394 77,182,810 98.39%
FCC2GPDACXX-CHKPEI13070006_L4 70,874,412 69,781,770 98.46%
FCC2GPDACXX-CHKPEI13070007_L6 66,601,007 66,235,650 99.45%
Read Totals for each 1996 & 2012 samples

Read Totals for each 1996 & 2012 samples

Read Totals for each 1996 & 2012 samples

Read Totals for each 1996 & 2012 samples

GC Content

GC content for all 1996 and 2012 samples.

Inspection by GC content showed the gc2709 and gc2700 appeared to have an exaggerated peak around 60%, whilst all other samples showed a more broad spread across the range. From the Turretfield samples, pt1125 showed an unexpected peak around 50% indicating that sample may contain reads from a different species. This sample should be excluded from all further analysis.

GC content for all Turretfield samples.

Adapter Content

Adapter content was surprisingly high for some samples, and trimming to 71bp may be appropriate.

Adapter Content for all 1996 & 2012 samples.

Sequence Content

Sequence content showing the presence of the RS in both the Turretfield and R2 samples.

Conclusion